Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing

نویسندگان

چکیده

Nested (fork-join) parallelism eases parallel programming by enabling high-level expression of and leaving the mapping between tasks hardware to runtime scheduler. A challenge in dynamic scheduling nested is how exploit data locality, which has become more demanding deep cache hierarchies modern processors with a large number cores. This paper introduces almost deterministic work stealing (ADWS) , efficiently exploits locality deterministically planning cache-hierarchy-aware schedule, while allowing little variety facilitate load balancing. Furthermore, we propose an extension our prior on ADWS achieve better shared utilization. The improved version scheduler called xmlns:xlink="http://www.w3.org/1999/xlink">multi-level ADWS . idea that only part computation whose working set size small enough fit into scheduled within recursively, thus avoiding excessive capacity misses. Our evaluation benchmark decision tree construction demonstrated multi-level outperformed conventional random Cilk Plus 61% it showed 40% performance improvement over previous design.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Cache Memory Utilization

In this paper, an efficient technique is proposed to manage the cache memory. The proposed technique introduces some modifications on the well-known set associative mapping technique. This modification requires a little alteration in the structure of the cache memory and on the way by which it can be referenced. The proposed alteration leads to increase the set size virtually and consequently t...

متن کامل

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Reordering instructions and data layout can bring significant performance improvement for memory bounded applications. Parallelizing such applications requires a careful design of the algorithm in order to keep the locality of the sequential execution. In this paper, we aim at finding a good parallelization of memory bounded applications on multicore that preserves the advantage of a shared cac...

متن کامل

Work stealing for GPU-accelerated parallel programs in a global address space framework

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parall...

متن کامل

Improving Memory Utilization in Cache Coherence Directories

Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significan...

متن کامل

Piecewise execution of nested data-parallel programs

The technique of flattening nested data parallelism combines all the independent operations in nested apply-to-all constructs and generates large amounts of potential parallelism for both regular and irregular expressions. However, the resulting data-parallel programs can have enormous memory requirements, limiting their utility. In this paper, we present piecewise execution, an automatic metho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems

سال: 2022

ISSN: ['1045-9219', '1558-2183', '2161-9883']

DOI: https://doi.org/10.1109/tpds.2022.3196192